Principal Component Analysis

Introduction:

Analysis was conducted on the lemurs dataset to determine relationships that are not inherently apparent upon a basic/elementary inspection. Because of the numerous columns within this dataset, Principal Component Analysis (PCA) was performed to reduce dimensionality while preserving the information provided. This dataset provides information on 27 different types of lemurs, or taxa, collected by the Duke Lemur Center (DLC). With 82,609 rows and 54 columns, this dataset contains vast information regarding the lives of roughly 3,060 lemurs of each taxon.

While PCA was performed on the entire dataset, the main columns of interest within this research document are the following numerical columns: age_at_death_y, age_at_wt_mo_no_dec, age_at_wt_y, age_max_live_or_dead_y, avg_daily_wt_change_g, birth_month, change_since_prev_wt_g, concep_month, dam_age_at_concep_y, days_before_death, days_since_prev_wt, expected gestation, month_of_weight, litter size, n_known_offspring, r_min_dam_age_at_concep_y, sire_at_concep_y, and wight_g. The non-numerical columns of interest are sex, hybrid, age_category, and taxon. These columns give insights on biological and physical features of each lemur. Performing PCA on this dataset will allow for an increase in readability and intrepretability while preserving as much information as possible.

Details on the specific columns are shown below:

age_at_death_y: the age of the lemur at the time of death

age_at_wt_mo_no_dec: the age of the lemur at the time ‘weight_g’ was recorded rounded down

age_at_wt_y: the age of the lemur at the time ‘weight_g’ was recorded in years

age_max_live_or_dead_y: the maximum age the animal could have achieved as of the date the datafile was last updated

avg_daily_wt_change_g: the average change of weight for a lemur each day

birth_month: the month of birth of the lemur

change_since_prev_wt_g: the weight change since previous recorded weight in grams

concep_month: the month of conception of the lemur

dam_age_at_concep_y: the mother of the lemurs age at conception

days_before_death: the number of days the weight of the lemur was recorded before death

days_since_prev_wt: the number of days the initial weight of the lemur was recorded since new weight was recorded

expected gestation: the expected gestation of the lemur

litter size: the size of the little of the lemur

month_of_weight: the month that the lemurs weight was recorded

n_known_offspring: the number of known offspring of the lemur

r_min_dam_age_at_concep_y: the dam minimum age at conception

sire_at_concep_y: the father of the lemurs age at conception

wight_g: wight of the lemur in grams

The dataset analyzed within this research document, lemurs, is below:

lemurs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv')
head(lemurs)

## # A tibble: 6 × 54
##   taxon dlc_id hybrid sex   name      curre…¹ stud_…² dob        birth…³ estim…⁴
##   <chr> <chr>  <chr>  <chr> <chr>     <chr>   <chr>   <date>       <dbl> <chr>  
## 1 OGG   0005   N      M     KANGA     N       <NA>    1961-08-25       8 <NA>   
## 2 OGG   0005   N      M     KANGA     N       <NA>    1961-08-25       8 <NA>   
## 3 OGG   0006   N      F     ROO       N       <NA>    1961-03-17       3 <NA>   
## 4 OGG   0006   N      F     ROO       N       <NA>    1961-03-17       3 <NA>   
## 5 OGG   0009   N      M     POOH BEAR N       <NA>    1963-09-30       9 <NA>   
## 6 OGG   0009   N      M     POOH BEAR N       <NA>    1963-09-30       9 <NA>   
## # … with 44 more variables: birth_type <chr>, birth_institution <chr>,
## #   litter_size <dbl>, expected_gestation <dbl>, estimated_concep <date>,
## #   concep_month <dbl>, dam_id <chr>, dam_name <chr>, dam_taxon <chr>,
## #   dam_dob <date>, dam_age_at_concep_y <dbl>, sire_id <chr>, sire_name <chr>,
## #   sire_taxon <chr>, sire_dob <date>, sire_age_at_concep_y <dbl>, dod <date>,
## #   age_at_death_y <dbl>, age_of_living_y <dbl>, age_last_verified_y <dbl>,
## #   age_max_live_or_dead_y <dbl>, n_known_offspring <dbl>, …

More information about the dataset can be found here: https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-08-24 and https://www.nature.com/articles/sdata201419.

Below shows the manipulation/engineering to the dataset:

#lemurs[mapply(is.infinite, lemurs)] <- NA #changing the infinite values to NA/NaN
lemur <- lemurs %>%
  select(where(is.numeric) | c("sex","taxon","hybrid","age_category")) %>%
  select(-c("age_at_wt_wk","age_at_wt_d","age_at_wt_mo","age_of_living_y","age_last_verified_y","expected_gestation_d","days_before_inf_birth_if_preg","pct_preg_remain_if_preg","infant_lit_sz_if_preg")) %>%
  select_if(~ !all(is.na(.))) #selecting those that are not all NA/NaN

head(lemur)

## # A tibble: 6 × 22
##   birth_month litter_s…¹ expec…² conce…³ dam_a…⁴ sire_…⁵ age_a…⁶ age_m…⁷ n_kno…⁸
##         <dbl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1           8          1     129       4    2.22    2.22    15.5    15.5       7
## 2           8          1     129       4    2.22    2.22    15.5    15.5       7
## 3           3          1     129      11    1.78    1.78    13.6    13.6       9
## 4           3          1     129      11    1.78    1.78    13.6    13.6       9
## 5           9          1     129       5    4.32    4.32    10.4    10.4       1
## 6           9          1     129       5    4.32    4.32    10.4    10.4       1
## # … with 13 more variables: weight_g <dbl>, month_of_weight <dbl>,
## #   age_at_wt_mo_no_dec <dbl>, age_at_wt_y <dbl>, change_since_prev_wt_g <dbl>,
## #   days_since_prev_wt <dbl>, avg_daily_wt_change_g <dbl>,
## #   days_before_death <dbl>, r_min_dam_age_at_concep_y <dbl>, sex <chr>,
## #   taxon <chr>, hybrid <chr>, age_category <chr>, and abbreviated variable
## #   names ¹litter_size, ²expected_gestation, ³concep_month,
## #   ⁴dam_age_at_concep_y, ⁵sire_age_at_concep_y, ⁶age_at_death_y, …

Approach:

To perform PCA on a dataset, pre-processing should take place. Because the prcomp() function only works with numerical values, a few columns were removed, this was done with the select(where(is.numeric)) function.

Steps for PCA

1. Standardization: Standardization(X) = (X(i) - mean(X)) / standard deviation(X)

This entails transforming the data to ensure that each variable within the dataset contributes equally to the analysis. Standardization ensures that there are no biases toward specific variables.

2. Covariance Matrix: Covariance(X) = E [(X − E[X]) (X − E[X]).T]

Where X is the vector, E represents the expectation of the vector X, and .T depicts the transformation of a vector.

This step is essential in determining if there are any relationships between the variables. The correlation matrix is of square size of the length of the datasets numerical columns and each value corresponds to the relationship between the associated variables.

3. Identify Principal Components

The principal components are determined by computing the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors determine the direction and eigenvalues are the magnitude of variance each initial variable has with each principal component. The principal components are constructed as mixtures of the initial variables. The components act as new variables that the initial variables are uncorrelated with (i.e., perpendicular to). Finding the principal components of a dataset allows for better interpretation of the data and the relationships that may exist within.

4. Feature Vectors and Recasting

Feature vectors are necessary for dimensionality reduction. In this step, the eigenvectors and eigenvalues are taken and sorted from high to low significance. The eigenvalues that are of low magnitude have lower significance and can be discarded from the analysis. Recasting just entails graphing the initial variables onto the principal components shown below in Figure 3.

Below shows the computation of the principal components:

lemur_pca <- na.omit(lemur) %>%
  select(where(is.numeric)) %>% # retain only numeric columns
  scale() %>%                   # scale to zero mean and unit variance
  prcomp()                      # performing principal component analysis

Below shows a summary of the principal component analysis performed:

summary(lemur_pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9345 1.7815 1.3207 1.16583 1.12656 1.03264 0.99839
## Proportion of Variance 0.2079 0.1763 0.0969 0.07551 0.07051 0.05924 0.05538
## Cumulative Proportion  0.2079 0.3842 0.4811 0.55662 0.62712 0.68636 0.74174
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.96297 0.94022 0.82511 0.78650 0.71344 0.65474 0.59451
## Proportion of Variance 0.05152 0.04911 0.03782 0.03437 0.02828 0.02382 0.01964
## Cumulative Proportion  0.79326 0.84237 0.88019 0.91456 0.94284 0.96665 0.98629
##                           PC15     PC16      PC17      PC18
## Standard deviation     0.49682 0.002492 0.0003523 5.334e-14
## Proportion of Variance 0.01371 0.000000 0.0000000 0.000e+00
## Cumulative Proportion  1.00000 1.000000 1.0000000 1.000e+00

Analysis:

The below graphs are the initial variables graphed against the first and second principal components, colored by the sex, age_category, and taxon. The first and second principal components are the two principal components with the highest variance with the initial variables.

arg_lemur <- lemur_pca %>%
  augment(na.omit(lemur))
sex_pca_plot <- ggplot(
  arg_lemur,
  aes(.fittedPC1, .fittedPC2, color = sex)
  ) +
  geom_point(
    alpha = 0.85
    ) +
  ggtitle("Principal Components") +
  labs( #adding labels
    x = "Principal Component 1",
    y = "Principal Component 2",
    subtitle = "Figure 1"
    ) +
  scale_color_brewer(
    name = "Sex",
    palette = "Set2", #custom palette
    guide = guide_legend(nrow=2)
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank()
    ) 

age_pca_plot <- ggplot(
  arg_lemur,
  aes(.fittedPC1, .fittedPC2, color = age_category)
  ) +
  geom_point(
    alpha = 0.85
    ) +
  labs( #adding labels
    x = "Principal Component 1",
    y = "Principal Component 2"
    ) +
  scale_color_brewer(
    name = "Age Category",
    labels = c("Adult","Infant or Juvinile","Young Adult"),
    palette = "Accent", #custom palette
    guide = guide_legend(nrow=3)
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank()
    ) 

sex_pca_plot | age_pca_plot

arg_lemur <- lemur_pca %>%
  augment(na.omit(lemur))
taxon_pca_plot1 <- ggplot(
  arg_lemur,
  aes(.fittedPC1, .fittedPC2,color = taxon) #assigning color as the sex
  ) +
  geom_point(
    alpha = .85
    ) +
  ggtitle("Principal Components") +
  labs( #adding labels
    x = "Principal Component 1",
    y = "Principal Component 2",
    subtitle = "Figure 2",
    ) +
  scale_color_manual(
    name = "",
    values = c("#a3df5a","#5a124b", "#d185f2", "#ada6e1","#8cfaf7","#07d4c0","#00f945","#f1f653","#ee8d00","#ff7c5b","#ca4732","#f886c0","#767293","#3f6a69","#055049","#039f2f","#8a8d34","#8d5504","#8e4836","#71291e","#ff32d3","#e098ff","#d8d3ff","#99fffd","#97ffb4","#1f73f5","#f9947a","#ff2200") , #custom palette
    guide = guide_legend(nrow=14)
    ) +
  #scale_color_viridis(discrete = TRUE) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "none",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank()
    ) 

taxon_pca_plot2 <- ggplot(
  arg_lemur,
  aes(.fittedPC8, .fittedPC9,color = taxon) #assigning color as the sex
  ) +
  geom_point(
    alpha = .85
    ) +
  labs( #adding labels
    x = "Principal Component 8",
    y = "Principal Component 9",
    ) +
  scale_color_manual(
    name = "Taxon",
    values = c("#a3df5a","#5a124b", "#d185f2", "#ada6e1","#8cfaf7","#07d4c0","#00f945","#f1f653","#ee8d00","#ff7c5b","#ca4732","#f886c0","#767293","#3f6a69","#055049","#039f2f","#8a8d34","#8d5504","#8e4836","#71291e","#ff32d3","#e098ff","#d8d3ff","#99fffd","#97ffb4","#1f73f5","#f9947a","#ff2200") , #custom palette
    guide = guide_legend(nrow=14)
    ) +
  #scale_color_viridis(discrete = TRUE) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "right",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    legend.text=element_text(size=8),
    legend.spacing.x = unit(0.0, 'cm')
    ) 

taxon_pca_plot1 | taxon_pca_plot2

The graph below shows the initial variables graphed against the first and second principal components. This time the variables are graphed by their eigenvectors and eigenvalues.

arrow_style <- arrow(
  angle = 20, length = grid::unit(6, "pt"),
  ends = "first", type = "closed"
) 
lemur_pca %>%
  # extract rotation matrix
  tidy(matrix = "rotation") %>%
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) %>%
  ggplot(aes(PC1, PC2,color = factor(column))) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style, 
    size = 0.9, alpha = 0.9
  ) +
  ggtitle("Principal Components") +
  labs( #adding labels
    x = "Principal Component 1",
    y = "Principal Component 2",
    color = "Variables",
    subtitle = "Figure 3"
    ) +
  scale_color_manual(
    name = "",
    values = c("#a3df5a","#5a124b", "#d185f2", "#ada6e1","#8cfaf7","#07d4c0","#00f945","#f1f653","#ee8d00","#ff7c5b","#ca4732","#f886c0","#767293","#3f6a69","#055049","#039f2f","#8a8d34","#8d5504","#8e4836","#71291e","#ff32d3","#e098ff","#d8d3ff","#99fffd","#97ffb4","#1f73f5","#f9947a","#ff2200") , #custom palette
    guide = guide_legend(ncol=1)
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "right",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    legend.text=element_text(size=7),
    legend.spacing.y = unit(0.0, 'cm')
    )

Below shows a bar graph of the variance that each principal component has with the initial variables.

lemur_pca %>%
  # extract eigenvalues
  tidy(matrix = "eigenvalues") %>%
  ggplot(aes(PC, percent, fill = "")) + 
  geom_col() +
  geom_point() +
  geom_line() +
  scale_x_continuous(
    breaks = 1:18 # create one axis tick per PC
  ) +
  scale_y_continuous(
    # format y axis ticks as percent values
    label = scales::label_percent(accuracy = 1)
  ) +
  ggtitle("Variance Explained by each Principal Component") +
  labs( #adding labels
    x = "Principal Component",
    y = "Variance Explained",
    subtitle = "Figure 4"
    ) +
  scale_fill_brewer(
    palette = "Paired" #custom palette
    ) +
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = 'none',
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank()
    )

Discussion:

Principal component analysis is an essential part of machine learning and data science because it allows for better interpretation of data. This analysis allows for dimensionality reduction without the loss of valuable information that the dataset may contain. Generally, PCA is used for identifying relationships in datasets where variables are strongly correlated. This analysis showed that the variables within this data are not strongly correlated.

The analysis done on the lemurs dataset is shows various relationships. A depiction of the initial variables plotted against the first two principal components is shown in Figure 1. The graphs colored by both sex and age_category are shown for grouping visualizations purposes. Figure 2 shows how when plotting against two principal components that have high variance such as the graph on the left (PC1 vs. PC2), the groups are profound. Whereas in the right plot (PC8 vs. PC9) the groups are less extreme because the variance of the principal components with the initial variables are lower.

In Figure 3, age_at_wt_y and age_max_live_or_dead_y show to have a strong positive relationship with Principal Component 1 and expected_gestation has a moderate negative relationship with Principal Component 1. This component could be a potential measure for time or number of years. The variables expected_gestation,r_min_dam_age_at_concep_y, dam_age_at_concep_y, sire_age_at_concep_y, and weight_g all have a strong negative relationship with Principal Component 2, whereas avg_daily_wt_change_g and litter_size tend to have a strong positive relationship with Principal Component 2. This component could measure fertility factors. The days_since_prev_wt column seems to have a minimal relationship with Principal Component 1 and Principal Component 2.

Finally, Figure 4 portrays the variance each principal component has with the initial variables. The higher variance in a principal component corresponds to higher dispersion with the initial variables, thus containing more information to draw from. Other cases for PCA include image processing and many classification tasks.

Disclaimer: A number of variables were removed in this analysis for simplicity. These columns include: age_at_wt_wk, age_at_wt_d, age_at_wt_mo, age_of_living_y, age_last_verified_y, expected_gestation_d, days_before_inf_birth_if_preg, pct_preg_remain_if_preg, infant_lit_sz_if_preg. This does not compromise the validity of the findings.

Principal Component Analysis

Lemurs at Duke Lemur Center

Ashlyn Bain